The Generalized Spike Process, Sparsity, and 
Statistical Independence 

Naoki Saito 
Department of Mathematics 
University of California 
Davis, CA 95616 USA 
Email: saito@math.ucdavis.edu 

Abstract 

A basis under which a given set of realizations of a stochastic pro- 
cess can be represented most sparsely (the so-called best sparsifying 
basis (BSB)) and the one under which such a set becomes as less 
statistically dependent as possible (the so-called least statistically- 
dependent basis (LSDB)) are important for data compression and 
have generated interests among computational neuroscientists as well 
as applied mathematicians. Here we consider these bases for a par- 
ticularly simple stochastic process called "generalized spike process" , 
which puts a single spike — whose amplitude is sampled from the stan- 
dard normal distribution — at a random location in the zero vector of 
length n for each realization. 

Unlike the "simple spike process" which we dealt with in our pre- 
vious paper and whose amplitude is constant, we need to consider 
the kurtosis-maximizing basis (KMB) instead of the LSDB due to the 
difficulty of evaluating differential entropy and mutual information 
of the generalized spike process. By computing the marginal densi- 
ties and moments, we prove that: 1) the BSB and the KMB selects 
the standard basis if we restrict our basis search within all possible 
orthonormal bases in M n ; 2) if we extend our basis search to all possi- 
ble volume-preserving invertible linear transformations, then the BSB 
exists and is again the standard basis whereas the KMB does not ex- 
ist. Thus, the KMB is rather sensitive to the orthonormality of the 
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transformations under consideration whereas the BSB is insensitive to 
that. Our results once again support the preference of the BSB over 
the LSDB/KMB for data compression applications as our previous 
work did. 



1 Introduction 

This paper is a sequel to our previous paper ||, where we considered the 
so-called best sparsifying basis (BSB), and the least statistically- dependent 
basis (LSDB) for the input data which are the realizations of a very simple 
stochastic process called the "spike process." This process, which we will refer 
to as the "simple" spike process for convenience, puts a unit impulse (i.e., its 
amplitude is constant 1) at a random location in a zero vector of length n. 
Here, the BSB is the basis in M n that best sparsifies the given input data, and 
the LSDB is the basis in ~R n that is the closest to the statistically independent 
coordinate system (regardless of whether such a coordinate system exists or 
not). In particular, we considered the BSB and LSDB chosen from all possible 
orthonormal transformations (i.e., 0(n)) or all possible volume-preserving 
linear transformations (i.e., SL ± (n, M), where any element in this set has its 
determinant ±1). 

In this paper, we consider the BSB and LSDB for a slightly more compli- 
cated process, the "generalized" spike process, and compare them with those 
of the simple spike process. The generalized spike process puts an impulse 
whose amplitude is sampled from the standard normal distribution N(0, 1). 

Our motivation to analyze the BSB and the LSDB for the generalized 
spike process stems from the work in computational neuroscience |T7|, [I8|, 



as well as in computational harmonic analysis ||. The concept of 
sparsity and that of statistical independence are intrinsically different. Spar- 
sity emphasizes the issue of compression directly, whereas statistical inde- 
pendence concerns the relationship among the coordinates. Yet, for certain 
stochastic processes, these two are intimately related, and often confusing. 
For example, Olshausen and Field |17|], |L8| emphasized the sparsity as the 



basis selection criterion, but they also assumed the statistical independence 
of the coordinates. For a set of natural scene image patches, their algo- 
rithm generated basis functions efficient to capture and represent edges of 
various scales, orientations, and positions, which are similar to the recep- 
tive field profiles of the neurons in our primary visual cortex. (Note the 
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criticism raised by Donoho and Flesia || about the trend of referring to 
these functions as "Gabor"-like functions; therefore, we just call them "edge- 
detecting" basis functions in this paper.) Bell and Sejnowski g used the 
statistical independence criterion and obtained the basis functions similar to 
those of Olshausen and Field. They claimed that they did not impose the 
sparsity explicitly and such sparsity emerged by minimizing the statistical 
dependence among the coordinates. These motivated us to study these two 
criteria. However, the mathematical relationship between these two criteria 
in the general case has not been understood completely. We wish to deepen 
our understanding of this intricate relationship. Therefore we chose to study 
such spike processes, which are much simpler than the natural scene images 
viewed as a high- dimensional stochastic process. It is important to use simple 
stochastic processes first since we can gain insights and make precise state- 
ments in terms of theorems. By these theorems, we now understand what are 
the precise conditions for the sparsity and statistical independence criteria 
to select the same basis for the spike processes, and the difference between 
the simple and generalized stochastic processes. 

The organization of this paper is as follows. The next section specifies our 
notation and terminology. Section [3] defines how to quantitatively measure 
the sparsity and statistical dependence of a stochastic process relative to a 
given basis. Section |3] reviews the results on the simple spike process we 
obtained in || . Our main results are presented in Section |5| where we deal 
with the generalized spike process. We conclude with discussion in Section 



2 Notations and Terminology 

Let us first set our notation and the terminology. Let X G W 1 be a random 
vector with some unknown probability density function (pdf) fx- Let B G 
D, where D is the so-called basis dictionary. For very high dimensional 
data, we often use the wavelet packets and local Fourier bases as T> (see 
2D| and references therein for more about such basis dictionaries). In this 



paper, however, we use much more larger dictionaries: 0(n) (the group of 
orthonormal transformations in W n ) or SL ± (n,M) (the group of invertible 
volume-preserving transformations in R n , i.e., their determinants are ±1). 
We are interested in searching a basis under which the original stochastic 
process becomes either the sparsest or the least statistically dependent among 
the bases in D. Let C(B \ X) be a numerical measure of deficiency or cost 
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of the basis B given the input stochastic process X. Under this setting, the 
best basis for the stochastic process X among D relative to the cost C is 
written as B+ = argminsgD G(B \ X). 

We also note that log in this paper implies log 2 , unless stated otherwise. 
The n x n identity matrix is denoted by I n , and the n x 1 column vector 
whose entries are all ones, i.e., (1, 1, . . . , 1) T , is denoted by 1„. 

3 Sparsity vs. Statistical Independence 

Let us now define the measure of sparsity and that of statistical independence 
to evaluate a given basis (coordinate system). 

3.1 Sparsity 

Sparsity is a key property as a good coordinate system for compression. The 
true sparsity measure for a given vector x e R™ is the so-called £° quasi-norm 
which is defined as 

\\x\\o = #{« e [l,n] : Xi ^ 0}, 

i.e., the number of nonzero components in x. This measure is, however, 
very unstable for even small perturbation of the components in a vector. 
Therefore, a better measure is the l v norm: 




<p < 1. 



In fact, this is a quasi-norm for < p < 1 since this does not satisfy the trian- 
gle inequality, but only satisfies weaker conditions: ||£C + i/||p < 2 _1 / p (||aj|| p + 
\\y\\ p ) where p' is the conjugate exponent of p; and \\x + y\\^ < \\x\\p + 
It is easy to show that lim p j \\ x \\p — \\ x \\o- See Q for the details of the £ p 
norm properties. 

Thus, we can use the expected l v norm minimization as a criterion to find 
the best basis for a given stochastic process in terms of sparsity: 

e p (B\X) = E\\B- 1 X\\ p p , (1) 
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We propose to use the minimization of this cost to select the best sparsifying 
basis (BSB): 

B„ = argmin GJB I X). 

Remark 3.1. It should be noted that the minimization of the £ p norm can 
also be achieved for each realization. Without taking the expectation in 
([]]), one can select the BSB B p = B p (x,T)) for each realization x. We can 
guarantee that 

min GJB I X = x) < min GJB I X) < max GJB I X = x). 

BeD F BeD ^ BeD F 

For highly variable or erratic stochastic processes, however, B p (x,T>) may 
significantly change for each x and we need to store more information of this 
set of N bases if we want to use them to compress the entire training dataset. 
Whether we should adapt a basis per realization or on the average is still an 
open issue. See [BTJ for more details. 



3.2 Statistical Independence 

The statistical independence of the coordinates of Y e R n means fy{y) — 
iVi (2/1)^2(2/2) • • • fY n {Vn), where f Y „ is a one-dimensional marginal pdf of f Y - 
The statistical independence is a key property as a good coordinate system 
for compression and particularly modeling because: 1) damage of one coor- 
dinate does not propagate to the others; and 2) it allows us to model the 
n-dimensional stochastic process of interest as a set of ID processes. Of 
course, in general, it is difficult to find a truly statistically independent co- 
ordinate system for a given stochastic process. Such a coordinate system 
may not even exist for a certain stochastic process. Therefore, the next best 
thing we can do is to find the least-statistically dependent coordinate system 
within a basis dictionary. Naturally, then, we need to measure the "close- 
ness" of a coordinate system Y 1} . . . , Y n to the statistical independence. This 
can be measured by mutual information or relative entropy between the true 
pdf fy and the product of its marginal pdf's: 

,(Y) 4 S My) ^T&k Ay 

n 
i=l 
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where H(Y) and H(Yi) are the differential entropy of Y and Y^ respectively: 
H{Y) = - J f Y (y) log f Y (v)dy 
H(Yi) = - [ frfa) \ogf Yi { yi )d yi . 



We note that I(Y) > 0, and I(Y) = if and only if the components of Y are 
mutually independent. See |7]] for more details of the mutual information. 

Suppose Y = B- l X and B G GL(ra,R) with det B = ±1. We denote 
such a set of matrices by SL (n, R). Note that the usual SL(n, R) is a subset 
of SL (n, R). Then, we have 

n n 

I(Y) = —H(Y) + £ H{Yi) = ~H{X) + E ^(^), 

i=\ i=l 

since the differential entropy is invariant under such an invertible volume- 
preserving linear transformation, i.e., 

H(B~ 1 X) = H(X) + log | det B^\ = H(X), 

because | det .B" 1 1 = 1. Based on this fact, we proposed the minimization 
of the following cost function as the criterion to select the so-called least 



statistically-dependent basis (LSDB) in the basis dictionary context |20 



G H (B \X) = J2 H ((5- J X)0 = H(Yi). (2) 



i=l i=l 

Now, we can define the LSDB as 

Blsdb = argmine H ( J B | X). 



We were informed that Pham ]19] had proposed the minimization of the same 
cost (0) earlier. We would like to point out the main difference between our 
work |2(| and Pham's. We used the basis libraries such as wavelet packets and 
local Fourier bases that allow us to deal with datasets with large dimensions 
such as face images whereas Pham used more general dictionary GL(n, R). 
In practice, however, the numerical optimization (|2|) clearly becomes more 
difficult in his general case particularly if one wants to use this for high 
dimensional datasets. 
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Closely related to the LSDB is the concept of the kurto sis-maximizing 
basis (KMB). This is based on the approximation of the marginal differen- 
tial entropy (fj) by higher order moments/cumulants using the Edgeworth 
expansion and was derived by Comon ||: 

H(Xi) ~ ~^(Xi) = -^Q*(Yi) - 3^(y*)) (3) 

where fik{Yi) is the fcth central moment of Y^, and K(Yi) / I^O^i) * s called 
the kurtosis of Yj. See also Cardoso || for a nice exposition of the various 
approximations to the mutual information. Now, the KMB is defined as 
follows:[] 

n 

B K = argminC K (.B | X) = arg max K>(Yj) , (4) 
Bel) bgd , =i 

where Q K (B\X) = - K ( Y *)- We note that the LSDB and the KMB are 
tightly related, yet can be different. After all, (|3|) is simply an approximation 
to the entropy up to the fourth order cumulant. We also would like to point 
out that Buckheit and Donoho M\ independently proposed the same measure 
as a basis selection criterion, whose objective was to find a basis under which 
an input stochastic process looks maximally "non- Gaussian." 

4 Review of Previous Results on the Simple 
Spike Process 

In this section, we briefly summarize the results of the simple spike process, 
which we obtained previously. See for the details and proofs. 

An n-dimensional simple spike process generates the standard basis vec- 
tors {ej}™ =1 C l n in a random order, where ej has one at the jth entry and 
all the other entries are zero. One can view this process as a unit impulse 
located at a random position between 1 and n. 

1 Note that there is a slight abuse of the terminology; We call the kurtosis-maximizing 
basis in spite of maximizing unnormalized version (without the division by /i|(Vi)) of the 
kurtosis. 
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4.1 The Karhunen-Loeve Basis 



The Karhunen-Loeve basis of this process is not unique and not useful be- 
cause of the following theorem. 

Proposition 4.1. The Karhunen-Loeve basis for the simple spike process is 
any orthonormal basis in R n containing the "DC" vector l n = (1, 1, . . . , 1) T . 

This theorem reminds us of non-Gaussianity of the simple spike process 

4.2 The Best Sparsifying Basis 

As for the BSB, we have the following result: 

Theorem 4.2. The BSB with any p G [0, 1] for the simple spike process is 
the standard basis if D — 0{n) or SL ± (n, M). 

4.3 Statistical Dependence and Entropy of the Simple 
Spike Process 

Before considering the LSDB of this process, let us note a few specifics about 
the simple spike process. First, although the standard basis is the BSB 
for this process, it clearly does not provide the statistically independent co- 
ordinates. The existence of a single spike at one location prohibits spike 
generation at other locations. This implies that these coordinates are highly 
statistically dependent. 

Second, we can compute the true entropy H(X) for this process unlike 
other complicated stochastic processes. Since the simple spike process se- 
lects one possible vector from the standard basis vectors of IR n with uniform 
probability 1/n, the true entropy H(X) is clearly logra. This is one of the 
rare cases where we know the true high- dimensional entropy of the process. 

4.4 The LSDB among O(n) 

For D = O(n), we have the following theorem. 
Theorem 4.3. The LSDB among O(n) is the following: 
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for n > 5, either the standard basis or the basis whose matrix represen- 
tation is 



1 

n 



n-2 -2 
-2 n-2 



-2 
-2 



-2 
-2 



n-2 -2 
-2 n-2 



(5) 



for n = 4, the Walsh basis, i.e., 



1111 
11-1-1 



1 -1 1 



1 



1 -1 1 



for n = 3, 



for n = 2 
independence is achieved. 



1 


1 


1 


f 


f 


-1 













1 


" 1 


1 


V2 


1 


-1 



; and 



and this is the only case where the true 



Remark 4.4. Note that when we say the basis is a matrix as above, we 
really mean that the column vectors of that matrix form the basis. This 
also means that any permuted and/or sign-flipped (i.e., multiplied by — 1) 
versions of those column vectors also form the basis. Therefore, when we 
say the basis is a matrix A, we mean not only A but also its permuted and 
sign-flipped versions of A. This remark also applies to all the propositions 
and theorems below, unless stated otherwise. 

Remark 4.5. There is an important geometric interpretation of (|5]). This 
matrix can also be written as: 



B 



HR(n) 



In -2 



i r 



n \/n 



In other words, this matrix represents the Householder reflection with respect 
to the hyperplane {y G M n | Y17=o Vi = 0} wn ose unit normal vector is l n / y/n. 
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Below, we use the notation -Bo(n) for the LSDB among O(n) to distinguish 
it from the LSDB among GL(n, R), which is denoted by -Bgl(w)- So, for 
example, for n > 5, B ( n ) = 4 or B HR ( n ). 



4.5 The LSDB among GL(n,M) 

As discussed in ||, for the simple spike process, there is no important distinc- 
tion in the LSDB selection from GL(n, M) and from SL ± (n, M). Therefore, 
we do not have to treat these two cases separately. On the other hand, 
the generalized spike process in Section || requires us to treat SL ± (n, E) and 
GL(n, R) differently due to the continuous amplitude of the generated spikes. 
We now have the following curious theorem: 

Theorem 4.6. The LSDB among GL(n, M.) with n > 2 is the following basis 
pair (for analysis and synthesis respectively): 



D— 1 

GL(n) 



a a 
b 2 c 2 
b 3 h 



b 2 

c 3 



a 
b 2 
h 



b n -l 

K 



b n -l Cn-l 



b n -i 

(-a 



B 



GL(n) 



[ l + Y2=2 h k d k) /a ~d 2 -d 3 
—b2d 2 /a d 2 

-b 3 d 3 /a d 3 



-b n d n /a 













d„ 



(7) 



where a, b^, Ck are arbitrary real-valued constants satisfying a ^ 0, &fc ^ Ck, 
and d k = l/(c fe - b k ), k = 2, . . . , n. 

If we restrict ourselves to T> = SL ± (n, M), then the parameter a must 
satisfy: 

n 

a = ±Y\_( c k - bk)' 1 . 

k=2 
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Remark 4.7. The LSDB such as (§) and the LSDB pair (§), (g) provide 
us with further insight into the difference between sparsity and statistical 
independence. In the case of (Q), this is the LSDB, yet does not sparsify 
the spike process at all. In fact, these coordinates are completely dense, i.e., 
Co = n. We can also show that the sparsity measure C p gets worse as n — > oo. 
More precisely, we have the following proposition. 



Proposition 4.8. 



lim C p (B HR{n) | X) = I 

n— >oo I 



OO 

3 



*/0<p<l; 
ifp = 1. 



It is interesting to note that this LSDB approaches to the standard basis 
as n — > oo. This also implies that 



lim G p (B H R( n ) I X) ^ Q p ( lim B HR{ 



n) 



X 



As for the analysis LSDB @, the ability to sparsify the spike process 
depends on the values of bk and Ck- Since the parameters a, bk and Ck are 
arbitrary as long as a ^ and bk ^ Ck, let us put a = 1, bk = 0, Ck = 1, for 
k = 2, . . . ,n. Then we get the following specific LSDB pair: 



GL(n) 



1 1 






'n-1 



GL(n) 



1 






-1 



-1 



This analysis LSDB provides us with a sparse representation for the simple 
spike process (though this is clearly not better than the standard basis). For 



B GL(n) X ' 



e p = E[\\Y\\l\ 



Now, let us take a = 1, bk 
we get 

1 1 • 



1 n-l 
- x 1 + x 2 

n n 



n 



<p < 1. 



B, 



GL(n) 



1, Cfc 
1 



2 for k = 2, . . . , n in (0) and ([7]). Then 



'•• 1 

1 2 



5, 



GL(ra) 



n —1 
-1 



-1 
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The sparsity measure of this process is: 



1 n — 1 / 1\ 

Q - - xn + x {(n - 1) + 2 P } = n + (2 P - 1) 1 - - , < p < 1. 

n n, \ n J 

Therefore, the spike process under this analysis basis is completely dense, 
i.e., G p > n for < p < 1 and the equality holds if and only if p = 0. Yet 
this is still the LSDB. 



Finally, from Theorems 4.3 and 4.6, we can prove the following corollary: 



Corollary 4.9. There is no invertible linear transformation providing the 
statistically independent coordinates for the spike process for n > 2. 

5 The Generalized Spike Process 



In [|l(J , Donoho et al. analyzed the following generalization of the simple spike 
process in terms of the KLB and the rate distortion function. This process 
first picks one coordinate out of n coordinates randomly as before, but then 
the amplitude of this single spike is picked according to the standard normal 
distribution N(0, 1). The pdf of this process can be written as follows: 

/x(*)=-£(nWW«). ( 8 ) 

n i= i \ m j 

where S(-) is the Dirac delta function, and g(x) = (l/\/27r) • exp(— x 2 /2), i.e., 
the pdf of the standard normal distribution. Figure |l] shows this pdf for n = 
2. Interestingly enough, this generalized spike process shows rather different 
behavior (particularly in the statistical independence) from the simple spike 
process in Section f|. We also note that our proofs here are rather analytical 
compared to those for the simple spike process presented in 0, which have 
more combinatorial flavor. 



5.1 The Karhunen-Loeve Basis 

We can easily compute the covariance matrix of this process, which is pro- 
portional to the identity matrix. In fact, it is just I n /n. Therefore, we have 
the following proposition, which was also stated without proof by Donoho et 



al. 10 
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Figure 1: The pdf of the generalized spike process (n — 2). 



Proposition 5.1. The Karhunen-Loeve basis for the generalized spike pro- 
cess is any orthonormal basis in MJ 1 . 

Proof. Let us first compute the marginal pdf of By integrating out all 
Xi, i 7^ j , we can easily get: 

1 n — 1 

fxjixj) = -g(xj) + S(xj). 

J n n 

Therefore, we have = 0. Now, if X, and Xj cannot be simultaneously 

nonzero, therefore, 

E[XiXj] = 8 i:j E[X]} = U iS , 

since the variance of Xj is 1. Therefore, the covariance matrix of this process 
is, as announced, I n /n. Therefore, any orthonormal basis is the KLB. □ 

In other words, the KLB for this process is less restrictive than that for 
the simple spike process (Proposition 4.1]) , and the KLB is again completely 
useless for this process. 
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5.2 Marginal distributions and moments under SL ± ( 



71.. 



Before analyzing the BSB and LSDB, we need some background work. First, 
let us compute the pdf of the process relative to a transformation Y = B~ l X , 
B e SL ± (n, R). In general, if Y = B~ X X, then 

My) = ja^\ MBy) - 

Therefore, from (I3T), and the fact I det-Bl = 1, we have 



i=l W* / 



9(rjy), (9) 



where rj is the jth row vector of B. As for its marginal pdf, we have the 
following lemma: 



Lemma 5.2. 



1 n 

/^) = -5Z^'l A ^)' j = l,...,n, (10) 



n . 
i=i 



where Aij is the (i,j)th cof actor of matrix B, and g(y; a) = g(y/a)/a repre- 
sents the pdf of the normal distribution N(0, a 2 ). 

In other words, one can interpret the jth marginal pdf as a mixture of 
Gaussians with the standard deviations |A^-|, i — 1, . . . ,n. Figure [] shows 
several marginal pdf's for n = 2. As one can see from this figure, it can vary 
from a very spiky distribution to a usual normal distribution depending on 
the rotation angle of the coordinate. 

Proof. Let us rewrite @ as 

Mv) = - J2 5 ^y) ■ ■ ■ 5 ( r Uy) 5 ( r J + iy) • • • ^ly)g(rjy). (n) 

i=l 

The jth marginal pdf can be written as 

fy 3 (Vj) = J Mvu ■ ■ ■ > Vn)dyi ■ ■ ■ d%-id%+i • • • dy n . 
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Marginal Density Function at Various Rotation Angles 




Figure 2: The marginal pdf's of the generalized spike process (n — 2). All the 
pdf 's shown here are projections of the 2D pdf in Figure p] onto the rotated 
ID axis. The axis angle in the top row is 0.088 rad., which is close to the the 
first axis of the standard basis. The axis angle in the bottom row is 7r/4 rad., 
i.e., 45 degree rotation, which gives rise to the exact normal distribution. 
The other axis angles are equispaced angles between these two. 
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Consider the ith term in the summation of (|TT| ) and integrate it out with 
respect to y u . . . , y j+1 , ... ,y n : 

J S(r^y) ■ ■ ■ 5(rJ_ iy )5(rJ +1 y) ■ ■ ■ 5{r T n y)g{r^ y)d Vl ■ ■ ■ d%_id%+i ■ ■ ■ dy n . 

(12) 

We use a change of variable formula to integrate this. Let r\y = Xk, k = 
1, . . . , n, and let bi be the £th column vector of B. The relationship By = x 
can be rewritten as follows: 

B^yW+yjbf = x {l \ 

where B^'^ is the (n—1) x (n—1) matrix by removing zth row and jih column, 
and the vectors with superscripts indicate the length n — 1 column vectors by 
removing the elements whose indices are specified in the parentheses. This 
means that 



Thus, 



dy 



(?) 



dyi ■ ■ ■ dyj-tdyj+i ■ ■ -dy n 
1 -da® 



det 5^) 
1 



I A, 



rdxi • • • dxi-\dxi + \ ■ ■ ■ dx T 



Let us now express rjy = Xi in terms of yj and x. 



T 

r % y 



(*) 



■r)'y :j: ■ 



(13) 



(B^)- 1 *«+y i (^-(r?J) T (B^)- 1 6«) 



r? ) ) T (B^)" 1 a ! «±^-, 
/ Ay 
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where (*) follows from the following lemma whose proof is shown in Ap- 
pendix [A]: 

Lemma 5.3. For any B = (bij) G GL(n, R), 

K - (r?) T (B {hj) ) ^ bf = -i- det B, l<i,j<n. 

Now, let us go back to the integration (|T^). Thanks to the property of 
the delta function with Equation (13), we have 

J- ■ J 5(x 1 ) ■ ■ ■ 8{x i ^ 1 )5{x i+x ) ■ ■ ■ 8{x n )g{rJy)-^— ] &xi ■ ■ ■ dx i _ 1 dx i+1 • • • dx n 
- l —g(±y j /Ai j ) 



= #(^;|Ay|), 

where we used the fact that g(-) is an even function. Therefore, we can write 
the jth marginal distribution as announced in (|TD|). □ 

Let us now compute the moments of Y iy which will be used later. We use 
the fact that this is a mixture of n Gaussians each of which has mean and 
variance | Ajj| 2 . Therefore, it is obvious to have E[Yi] = for all i = 1, . . . , n. 
Now we have the following lemma for the moments. 

Lemma 5.4. 

^[l^H- ^r^) ^ 1 ^ 1 '' f° rall P>°- ( 14 ) 

Proof. We have: 

1 _ 71 _ r°° 

= -E / \y\ p g(y;\^\)dy 
n i=i J -°° 

^Ev^ |A - |Pr(1+p)D - i - (0) 

i=l 



by Gradshteyn and Ryzhik |TT|, Formula 3.462.1], where D_i_ p (-) is Whit- 
taker's function as defined by Abramowitz and Stegun pp.687]: 



D-a-iM ) = ^M) 



2 a / 2 + 1 /4r(a/2 + 3/4)' 
17 



Thus, putting a = p + 1/2 to the above equation yields: 

n cm = ^ 

»-i-pW 2V2+ P /2r(l+ p /2)- 
Therefore, we have 



2p/ 2 r(i + p/2) 



I Via ip r(p) 

n^ 1 wl 2P/ 2 - 1 T{p/2) 



r(p) £> 



..|p 
ij l > 



n 2P/2-i r(p/2) 

as we desired. □ 
5.3 The Best Sparsifying Basis 

As for the BSB, after all, there is no difference between the generalized spike 
process and the simple spike process. 

Theorem 5.5. The BSB with any p G [0, 1] for the generalized spike process 
is the standard basis ifD — O(n) or SL ± (n,]R). 



Proof. Let us first consider the case p G (0, 1]. Then, using Lemma |5.4] , the 
cost function (fj) can be rewritten as follows: 



n -p/ \ ran 

e p (B\x) = V E[\YA p ] = -fV-VVlAJ 

' 1 ^ u Jl 1 n2P/ 2 ~ 1 T(p/2) 



Let us now define a matrix I? = (Ay). Then B G SL (n,K) since 

fi- 1 = (A«) = ±(A«), 

detE v J ' v 3 h 

and B^ 1 G SL ± (n,R). Therefore, this reduces to 



TV \ n n 

v ' ' i=l j = l 
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This means that our problem now becomes the same as Theorem 1 in [[| (or 
Theorem [12] in this paper) by replacing B by B. Thus, it asserts that the 
B must be the identity matrix I n or its permuted or sign flipped versions. 
Suppose Aij = 5ij. Then, B^ 1 = ±(A 3 -j) = ±/„, which implies that B = ±I n . 
If (Ajj) is any permutation matrix, then B~ l is just that permutation matrix 
or its sign flipped version. Therefore, B is also a permutation matrix or its 
sign flipped version. 

Finally, let us consider the case p = 0. Then, any linear invertible trans- 
formation except the identity matrix or its permuted or sign-flipped versions 
clearly increases the number of nonzero elements after the transformation. 
Therefore, the BSB with p = is also a permutation matrix or its sign flipped 
version. 



This completes the proof of Theorem 5.5. □ 



5.4 The LSDB/KMB among O(n) 

As for the LSDB/KMB, we can see some difference from the simple spike 
process. 

Let us now consider a more specific case of T> = O(n). So far, we have 
been unable to prove the following conjecture. 

Conjecture 5.6. The LSDB among O(n) is the standard basis. 

The difficulty is the evaluation of the sum of the marginal entropies (0) 
for the pdf's of the form (fL0|) . However, a major simplification occurs if we 
consider the KMB instead of the LSDB, and we can prove the following: 

Theorem 5.7. The KMB among 0(n) is the standard basis. 

Proof. Because E\Yj] = and E[Y?\ = ± £™ =1 A?- for all j, the fourth 
order central moment of Yj can be written as /jL^Yj) = - ^" =1 A^-, and 
consequently the cost function in @ becomes 

e.(B|x) = |E(E^-s(E^)J- < 15 > 

Note that this is true for any B e SL ± (n, R). If we restrict our basis search 
within O(n), another major simplification occurs because we have the follow- 
ing special relationship between A^ and the matrix element bji of B e O(n): 

B- 1 = -r— — (A«) = B T . 
det5 v JlJ 
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In other words, 



Aij = (det B)bij = ±6^. 

Therefore, we have 



9. 

i=l i=l 



Inserting this into (|TH), we get the following simplified cost for D = O(n): 



e„(B|x) = -i(i-£2> 



ra n 

4 

U 



This means that the KMB can be rewritten as follows: 

B K = arg max > b%. (16) 

Let us note that the existence of the maximum is guaranteed because the set 
O(n) is compact and the cost function Yuij^tj * s continuous, 

Now, let us consider a matrix P = {p i3 ) = (&?■). Then, from the or- 
thonormality of columns and rows of B, this matrix P belongs to a set of 
doubly stochastic matrices $(n). Since doubly stochastic matrices obtained 
by squaring the elements of O(n) consist of a proper subset of S(n), we have 

max >&£,• < max > p?-. 

Now, we prove that such P must be an identity matrix or its permuted 
version. 



n n 



max 

PeS(n) * 



j=l i=l 3=1 \^i=iP«" 1 j =1 



= n, 

where the first equality follows from the fact that maxima of the radius of 
the sphere Y^iPij subject to YliPij = 1> Pij > occur only at the vertices of 
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that simplex, i.e., = e CT(j ), j = 1, . . . ,n where <r(-) is a permutation of n 
items. That is, the column vectors of P must be the standard basis vectors. 
This implies that the matrix B corresponding to P — I n or its permuted 
version must be either I n or its permuted and/or sign-flipped version. □ 

5.5 The LSDB/KMB among SL^n, R) 

If we extend our search to this more general case, we have the following 
theorem. 

Theorem 5.8. The KMB among SL ± (n, M) does not exist. 

Proof. The set SL ± (n, M) is not compact. Therefore, there is no guarantee 
that the cost function Q K (B | X) has a minimum value on this set. One 
can in fact consider a simple counter-example, B = diag(a, a -1 , 1, • • • ,1), 
where a is any nonzero real scalar. Then, one can show that G K (B | X) = 
— (a 4 + a~ 4 + n — 2), which tends to — oo as a f oo. □ 

As for the LSDB, we do not know whether the LSDB exists among 
SL ± (?t., M) at this point, although we believe that the LSDB is the stan- 
dard basis (or its permuted/sign-flipped versions). The negative result in 
the KMB does not imply the negative result in the LSDB. 



6 Discussion 

Unlike the simple spike process, the BSB and the KMB (an alternative to 
the LSDB) selects the standard basis if we restrict our basis search within 
O(n). If we extend our basis search to SL ± (n,M), then the BSB exists and 
is again the standard basis whereas the KMB does not exist. 

Although the generalized spike process is a simple stochastic process, we 
have the following important interpretation. Consider a stochastic process 
generating a basis vector randomly selected from some fixed orthonormal 
basis and multiplied by a scalar varying as the standard normal distribution 
at a time. Then, both that basis itself is the BSB and the KMB among 
O(n). Theorems 15.51 and BTTI claim that once we transform the data to the 



generalized spikes, one cannot do any better than that both in sparsity and 
independence within O(n). Of course, if one extends the search to nonlinear 
transformations, then it becomes a different story. We refer the reader to our 



recent articles [14], [15], for the details of a nonlinear algorithm. 
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The results of this paper further support our conclusion of the previous 
paper: dealing with the BSB is much simpler than the LSDB. To deal with 
statistical dependency, we need to consider the probability law of the under- 
lying process (e.g., entropy or the marginal pdf's) explicitly. That is why we 
need to consider the KMB instead of the LSDB to prove the theorems. Also 
in practice, given a finite set of training data, it is a nontrivial task to reliably 
estimate the marginal pdf's. Moreover, the LSDB unfortunately cannot tell 
how close it is to the true statistical independence; it can only tell that it is 
the best one (i.e., the closest one to the statistical independence) among the 
given set of possible bases. In order to quantify the absolute statistical depen- 
dence, we need to estimate the true high- dimensional entropy of the original 
process, H(X), which is an extremely difficult task in general. We would like 
to note, however, a recent attempt to estimate the high-dimensional entropy 
of the process by Hero and Michel [O, which uses the minimum spanning 
trees of the input data and does not require to estimate the pdf of the process. 
We feel that this type of techniques will help assessing the absolute statistical 
dependence of the process under the LSDB coordinates. Another interesting 
observation is that the KMB is rather sensitive to the orthonormality of the 
basis dictionary whereas the BSB is insensitive to that. Our previous results 
on the simple spike process (e.g., Theorems |4.3|, [4. 6|) also suggest the sensi- 
tivity of the LSDB to the orthonormality of the basis dictionary. This may 
restrict and discourage us to develop a new basis or a new basis dictionary 
that optimize the statistical independence. 

On the other hand, the sparsity criterion neither requires estimating the 
marginal pdf's nor reveals the sensitivity to the orthonormality. Simply 
computing the expected l v norms suffices. Moreover, one can even adapt 
the BSB for each realization rather than for the whole realizations, which is 
impossible for the LSDB, as we discussed in [0], P2"| , f2"T |. 



These observations, therefore, suggest that the pursuit of sparse repre- 
sentations should be encouraged rather than that of statistically independent 
representations, if we believe that mammalian vision systems were evolved 
and developed by the principle of data compression. This is also the view 
point indicated by Donoho ||. 

Finally, there are a few interesting generalizations of the spike processes, 
which need to be addressed in the near future. We need to consider a stochas- 
tic process that randomly throws in multiple spikes to a single realization. 
If one throws in more and more spikes to one realization, the standard basis 
is getting worse in terms of sparsity. Also, we can consider various rules to 
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throw in multiple spikes. For example, for each realization, we can select 
the locations of the spikes statistically independently. This is the simplest 
multiple spike process. Alternatively, we can consider a certain dependence 
in choosing the locations of the spikes. The ramp process of Yves Meyer an- 
alyzed by the wavelet basis is such an example; each realization of the ramp 
process generates a small number of spikes in the wavelet coefficients in the 
locations determined by the location of the discontinuity of the process. See 
[16 1, |22| for more about the ramp process. 



Unless very special circumstances, it would be extremely difficult to find 
the BSB of a complicated stochastic process (e.g., natural scene images) that 
truly converts its realizations to the spike process. More likely, a theoreti- 
cally and computationally feasible basis that sparsifies the realizations of a 
complicated process well (e.g., curvelets for the natural scene images ||) may 
generate expansion coefficients that may be viewed as an amplitude-varying 
multiple spike process. In order to tackle this scenario, we certainly need 
to: 1) identify interesting, useful, and simple enough specific stochastic pro- 
cesses; 2) develop the BSB adapted to such specific processes; and 3) deepen 
our understanding of the amplitude- varying multiple spike process. 
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A Proof of Lemma [573 

Proof. Let us consider the following system of linear equations: 
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where z"' 



[z\, • • • , Zj-x, Zj + i, ■ ■ ■ , z n ) G M. n 1 , j 



1, . . . , n. Using 



Cramer's rule (e.g., |L3|, pp.21]), we have, for k = 1, . . . , j — 1, j + 1, . . . , n, 



1 



det fl(«) 



det 



6« 




5 W 


6 (0 






6? 

















1 J Ay/(-l)*+i 



i+fc 



where (a) follows from the — j\ — 1) column permutations to move b 
located at the kth column to the j'th column of B^^\ and (b) follows from 
the definition of the cofactor. Hence, 



bij - ( r 



(?) 



Si)) T Z U) 



bij + -r— bik\k 

1 - 

b ik A ik 

% i k=i 



A, 



det 5. 



This completes the proof of Lemma 5.3. 



□ 
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